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Abstract 

Background: DNA shuffling generates combinatorial libraries of chimeric genes by stochastically recombining 
parent genes. The resulting libraries are subjected to large-scale genetic selection or screening to identify those 
chimeras with favorable properties (e.g., enhanced stability or enzymatic activity). While DNA shuffling has been 
applied quite successfully, it is limited by its homology-dependent, stochastic nature. Consequently, it is used only 
with parents of sufficient overall sequence identity, and provides no control over the resulting chimeric library. 

Results: This paper presents efficient methods to extend the scope of DNA shuffling to handle significantly more 
diverse parents and to generate more predictable, optimized libraries. Our CODNS (cross-over optimization for DNA 
shuffling) approach employs polynomial-time dynamic programming algorithms to select codons for the parental 
amino acids, allowing for zero or a fixed number of conservative substitutions. We first present efficient algorithms 
to optimize the local sequence identity or the nearest-neighbor approximation of the change in free energy upon 
annealing, objectives that were previously optimized by computationally-expensive integer programming methods. 
We then present efficient algorithms for more powerful objectives that seek to localize and enhance the frequency 
of recombination by producing "runs" of common nucleotides either overall or according to the sequence diversity 
of the resulting chimeras. We demonstrate the effectiveness of CODNS in choosing codons and allocating 
substitutions to promote recombination between parents targeted in earlier studies: two GAR transformylases (41% 
amino acid sequence identity), two very distantly related DNA polymerases, Pol X and /3 (15%), and beta- 
lactamases of varying identity (26-47%). 

Conclusions: Our methods provide the protein engineer with a new approach to DNA shuffling that supports 
substantially more diverse parents, is more deterministic, and generates more predictable and more diverse 
chimeric libraries. 



Background 

The harnessing of DNA recombination in vitro has trans- 
formed protein engineering by enabling engineers, like 
nature, to sample sequence space more broadly than is 
allowed by point mutagenesis at individual residues. 
Recombination produces chimeras comprised of sequen- 
tial fragments from parent genes, thereby bringing 
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together sets of sequences that were previously active in 
the parental background, and are thus likely to be less dis- 
ruptive than random ones. Chimeragenesis typically pro- 
duces combinatorial libraries, and those chimeras with 
beneficial properties can be identified by large-scale 
genetic screening and selection. 

DNA shuffling [1,2], the progenitor of recombination- 
based protein engineering, works by randomly digesting 
the parent genes into fragments and reassembling the 
fragments into new chimeric genes (Figure 1). Recombi- 
nation occurs when fragments from different parents are 
sufficiently complementary to anneal and prime synthesis 
from the 3' end. DNA shuffling has been called by its 
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Figure 1 Basic steps in gene shuffling protocol (following [30]). (1) Parental genes are stochastically fragmented. (2) The fragments are 
denatured, and strands with sufficient complementarity are annealed and extended. Cross-overs are formed when the complementary strands 
are from different parents and can be extended to complete fragments. The process is repeated for multiple rounds, generating additional 
fragments and cross-overs. (3) Ultimately a chimeric library is generated, some of whose members represent full-length genes, as shown. 



developer Pim Stemmer "the most dangerous thing you 
can do in biology" [3], due to its power in generating 
novel proteins. Indeed, it has been the basis both for 
commercial success (Affymax, Maxygen) and the devel- 
opment of effective protein variants [4-7] . 

DNA shuffling is both homology-dependent (recombina- 
tion can occur only in runs of similar DNA sequence), and 
stochastic (the engineer does not control the recombina- 
tion sites). Due to dependence on sequence similarity, 
DNA shuffling may fail to generate desirable chimeras (or 
any chimeras at all) for diverse parents, as they have only a 
few, small regions of DNA similarity, insufficient to gener- 
ate many cross-overs. Homology-independent stochastic 
methods (e.g., ITCHY [8] and SHIPREC [9]) mitigate the 
need for such parental sequence similarity, but at the cost 
of generating many more non-viable chimeras. 

In contrast with stochastic methods, site-directed 
methods enable the engineer to explicitly choose break- 
point locations so as to optimize expected library quality 
(e.g., by employing structural information [10], or by 
minimizing predicted disruption [11,12], library diversity 
[13], or both factors [14]). We have developed a site- 
directed method employing planned ligation of parental 
fragments by short overhangs [15]. We have coupled this 
approach to robotic implementation in order to generate 
specific chimeras in defined experimental vessels [16]. 
Such highly-directed methods of chimera generation are 
most useful when screening represents a significant 
effort. In those situations where screening or genetic 
selection is readily available, then stochastic approaches, 
with less overall cost, might prove preferable. 

We present here methods for extending stochastic 
experiments by optimizing DNA shuffling (Figure 2), 
yielding an approach that is less dependent on parental 
DNA sequence similarity (parents can be more diverse) 
and more deterministic (cross-overs are more predictable), 
and which is amenable to library optimization. Our 
approach, which we call CODNS (cross-over optimization 
for DN A shuffling), employs efficient (polynomial-time) 
dynamic programming algorithms to select a globally opti- 
mal set of codons for the parental amino acids, allowing 
for a fixed number of substitutions. While Moore and 



Maranas have also studied the problem of codon optimi- 
zation for shuffling [17], their eCodonOpt method 
employs computationally-expensive integer programming 
to select codons. We present dynamic programming 
recurrences for the two crossover-maximization objective 
functions of eCodonOpt: overall DNA sequence identity 
and overall free energy of annealing as approximated by a 
nearest-neighbor potential. We then develop recurrences 
for two more powerful objectives that seek to maximize 
crossovers by promoting DNA sequence identity within 
contiguous runs, optimizing either the overall number of 
runs or the diversity of the chimera library resulting from 
breakpoints in the contiguous runs. 

We demonstrate the effectiveness of CODNS in sev- 
eral case studies. We first optimize the GAR transformy- 
lases previously optimized by eCodonOpt [17]. We then 
show that CODNS can optimize two DNA polymerases 
(Pol X and Pol /3) that are sufficiently diverse (15% 
amino acid sequence identity) to previously require the 
development and application of the SCOPE method 
[10], instead of direct application of DNA shuffling. 
Finally, we study the impact of parental sequence iden- 
tity by considering pairs of beta-lactamase parents of 
differing diversity levels. 

Methods 

We take as input the amino acid sequences of the par- 
ent proteins to be shuffled, aligned to a length of n 
(amino acids and gaps) based on sequence and/or struc- 
ture. For simplicity of exposition, we present our meth- 
ods for the most common case of shuffling two parents, 
ai and a 2 . Our methods readily extend to creating 
equivalent sites for recombination in multiple parents, 
and it remains interesting future work to allow for non- 
uniform shuffling (i.e., where different cross-overs are 
possible between different pairs of parents). 

To optimize the shuffling experiment, we select a 
codon for each amino acid for each parent, yielding DNA 
sequences d x and d 2 of length 3n (maintaining gaps for 
those in the amino acid sequences). To expand the pool 
of codons being considered at a particular position, we 
may choose to make an amino acid substitution. Thus we 
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Figure 2 Schematic overview of CODNS. Here our method CODNS is applied to choose codons in a portion of two parental genes so as to 
produce a 9 nt "run" of common nucleotides, likely to be sufficient for a cross-over between complementary strands. To achieve the run, we 
use a combination of silent DNA substitutions (underlined) as well as a conservative amino acid substitution (boxed). The implications of these 
choices can be global; e.g., TTC for F and CTC for L would end the current run at 7 nt, but provide the first 2 nt of a new one. Our dynamic 
programming algorithm finds the globally optimal solution. 



take as additional input a specification of the allowed 
substitutions for each residue position for each parent, 
along with a number m of them to make. The allowed 
substitution specification may be derived from sequence 
and/or structural analysis of the parents, including 
general amino acid substitution matrices [18], position- 
specific amino acid statistics from related proteins [19], 
and AAG^ old fold predictions for possible substitutions 
[20]. The results presented below determine allowed sub- 
stitutions under the BLOSUM62 substitution matrix, 
considering only "conservative" substitutions which score 
no more than 4 worse than wild-type [15]. 

In describing the algorithms, we use possible codon sets 
representing the codons allowed at each position in the 
wild-type and under the allowed substitutions. For posi- 
tion z, set Ci[z] contains the possible codons for ai[i], 
pairing each with an indication of whether or not it 
requires a substitution, e.g., {(TTT, 0), (TTC, 0), (TGG, 1)} 
for an F that could potentially be mutated to W. Set C 2 [i] 
is defined similarly for the second parent. We note that 
these may readily be used to restrict where to employ 
mutations (e.g., masking based on structural analysis, as 
discussed by Moore and Maranas [17]), by allowing only 
wild-type codons (or amino acids) in some positions. 

We consider four types of objective function, targeting 
common nucleotides (at aligned positions), nearest-neighbor 
approximation to change in free energy of annealing (from 
dinucleotide pairs), common nucleotide runs (in contiguous 
strings), or library diversity (among resulting chimeras). 
We develop increasingly more complex dynamic program- 
ming algorithms to optimize these objectives. 

Common nucleotide optimization 

In this most basic optimization for DNA shuffling, the 
goal is to maximize the number of identical nucleotides 
at common positions: 



0nt = Zl{di\i]=d 2 [t\} (1) 

i=i 

where / is the indicator function (1 for true, 0 for 
false). 

With no substitutions allowed, each residue position is 
independent of each other one. Thus we simply select 
for each position a pair of codons (one for each parent) 
with a maximal number of common nucleotides. When 
substitutions are allowed, we need to allocate them for 
optimal impact. While several approaches are possible, 
we develop here one based on dynamic programming, 
to serve as the basis for the more complex objective 
functions we pursue in subsequent subsections. 

In our dynamic programming matrix, one dimension 
represents an aligned residue position (i.e., we have opti- 
mized the sequences up to that point), and the other 
represents a number of substitutions (i.e., we have made 
that many thus far). Let N[i, s] denote the number of 
common nucleotides within the first i residues, using 
exactly s substitutions. The value of N[i, s] extends the 
value of N[i - 1, s - (t x + t 2 )] with the additional number 
of common nucleotides obtained by selecting a pair of 
codons for position i while making t x + t 2 additional 
substitutions (0 or 1 for each parent). Optimal substruc- 
ture holds, since the optimal value of N[i, s] depends on 
the optimal value of N[i - 1, s - (ti + t 2 )]. The recur- 
rence is 

N [0, 0] = 0 (2) 
N \0, s] = -oo 5 > 0 (3) 

N [i, s] = max (N [i - 1, s - (ti + t 2 )] + g(c\, c 2 )) i > 0, s > 0 

(ci,ti)eCi[i], ' (A\ 
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where g gives the number (0-3) of common nucleo- 
tides for a pair of codons. 

After filling in the dynamic programming table, we 
trace back from N[n, m] to generate an optimal pair of 
DNA sequences. The matrix is of size n * m and each 
cell takes constant time to compute. 

AG° n optimization 

While it is hard to directly model the process of DNA 
shuffling [21-23], it is driven by the change in free 
energy upon annealing of the different parental nucleo- 
tide strands (coding of one with non-coding of the 
other; see Figure 1). We want to minimize the free 
energy change, so that it is favorable to cross over. 
Since the free energy change is very hard to compute, a 
common approach is to approximate it by decomposing 
the free energy into the sum of contributions from pairs 
of dinucleotides, the nearest-neighbor approximation 
[24]: 

3n-l 

o nn = AG^(di[i] 1], d 2 [i\ -d 2 [i + 1]) (5) 

i=l 

where AG° n (di[i] • di[i + 1], d 2 [i] • d 2 [i + 1]) is the free 
energy change associated with annealing dinucleotide d 1 
[i]-di[i + 1] with dinucleotide d 2 [i]-d 2 [i + 1]. These values 
can be computed from enthalpic AH (kcal/mol) and 
entropic AS (cal/mol-K) nearest-neighbor parameters 
compiled at 37°C and [Na + ] = 1.0 M [24], including both 
pairs of complementary strands. To actually estimate the 
change in free energy, there are additional constant 
terms such as the average initiation energy contribution; 
we omit them as they do not affect the optimization. 
While the underlying AG° n parameters are defined on 
pairs of dinucleotides, we abuse notation a bit in our for- 
mulation below and use AG° n for 4-mers to mean the 
sum over the constituent dinucleotides. 

We now develop a dynamic programming formulation 
to optimize this objective more efficiently (in polynomial 
time) than the integer linear programming of eCodo- 
nOpt [17], while still ensuring global optimality. In 
order to compute the AG° n contributions from a 
selected codon, we must also know the final nucleotide 
of the previous codon, as it forms a dinucleotide with 
the first nucleotide of the current codon. Thus we 
extend the common nucleotide dynamic programming 
table to keep track of this information. Figure 3 (left) 
illustrates the dependency; the recurrence is 



min {A[i-\, s-{h+t 2 ), b\, //.J+AG^J^-Ci, b' r c 2 )) i > 0, s > 



(8) 



A [0,0, b lt b 2 ] = 0 

A[0, s, bi, b 2 ] = oo 5 > 0 



(6) 



(7) 



where b-c indicates the concatenation of base b onto 
codon c, and AG° n estimates the change in free energy, 
as described in the text. Cell A[i, s, bi, b 2 ] holds the 
best score for the first i positions, using exactly s substi- 
tutions, with third nucleotides b x (first parent) and b 2 
(second parent) for position L As with common nucleo- 
tide optimization, if a codon pair makes t x + t 2 substitu- 
tions at position i, then A[i, s, b lf b 2 ] extends the 
solution to a cell for position i - 1 with s - (ti + t 2 ) sub- 
stitutions, considering any of the third nucleotides b\ 
and b r 2 at position i - 1. 

The table is of size n x m x 5 2 for 2 parents, since 
there are only (4 + l) 2 combinations of single nucleotide 
pairs for two parents (four nucleotides and a gap each). 
Each cell can be computed in constant time. In practice, 
we construct a 2D table (over i and s), with each cell 
maintaining a list of scores for the (bi, b 2 ) pairs that 
actually occur. 

Run optimization 

Moore and Maranas argued that the nearest-neighbor 
approximation to change in annealing free energy is a bet- 
ter objective for shuffling optimization than the number of 
common nucleotides [17]. Intuitively, since the nearest- 
neighbor approximation considers adjacent nucleotides 
together rather than treating them independently, it is 
more likely to yield sufficient complementarity between 
fragments and thereby promote recombination. Here we 
go even further and explicitly optimize for contiguous 
complementary regions, since annealing is driven by suffi- 
ciently long (anecdotally 6 nt or more) such regions. 

We define a common nucleotide run as a maximal- 
length substring appearing at aligned positions in the 
DNA sequences d x and d 2) and use as our objective 
function: 



Orun = £/(|R|) 
R 



(9) 



where f, which must be non-decreasing, indicates the 
value for DNA shuffling of a run of length \R\, and the 
sum is taken over all runs. We have implemented and 
tested several different scoring functions; the results use 
the following two functions: 



/iW = 



h{r) = 



0 r < 0 
r r >6 

0 r < 6 

9/4 * (r - 5) 5 < r < 9 
r r > 9 



(10) 



(11) 
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Figure 3 Examples of dynamic programming recurrences. The top part of each example shows relationships in the table and the bottom part 
the score differences, (left) AG° n optimization, a nearest-neighbor approximation to the change in free energy of annealing, summing adjacent 
nucleotide pairs. We must keep track of the third nucleotides of the codons for position /' - 1 in order to compute the AG° n contribution, (right) 
Run optimization cases: common codon; common codon after substitution; continue and end a run; continue and end one run and start a new 
one; end one run and start a new one. The run score contribution is computed based on the pattern of nucleotide matches. 



In/i, we count the total number of nucleotides in a 
run, but only if the run exceeds a given length (we 
empirically evaluated several thresholds). This assumes 
that cross-overs are impossible for runs with fewer than 
6 common nucleotides, and become increasingly likely 
with additional nucleotides beyond 6, lnf 2 , we consider 
cross-overs impossible for fewer than 6 nucleotides and 
very likely for 9 nucleotides or more (scoring the total 
number of nucleotides as in/i), and we ramp up from 
the impossible score of 0 at 5 nt to the likely score of 9 
at 9 nt, thereby counting the partial benefit that may be 
provided by runs between 6 and 9 nucleotides. 

We must extend our dynamic programming table with 
an additional dimension to keep track of the current run 
length. Thus we have a table in which cell R[i, s, r] 
holds the best score for the first i positions, using 
exactly s substitutions, such that the final nucleotide in 
the codons chosen for position i is the rth in a run (0 if 
mismatch). Again, if we make t substitutions at position 
i, then R[i, s, r] extends the solution to a cell for posi- 
tion i - 1 with s - (ti + t 2 ) substitutions. Now we must 
also account for the preceding run length; there are sev- 
eral cases (Figure 3, right): the codons chosen for the 
current amino acid position may continue a run from 
the previous position, may end that run, and may start a 
new run. In any case, the current r and possible codon 
pair determines the preceding r at which to look, and 
optimal substructure still holds. The recurrence is thus 

£[0,0,01=0 (12) 
R [0, 5, r] = -oo 5 > 0 or r > 0 (13) 

IR[i-l, s-{t 1+ t 2 ), r -3]+/(r)-/(r-3) Ci = c 2 , r > 3 

+ f(r> + a( Cl , c 2 )) -/(r') + /(r)) r = z( Cl , c 2 ) < 3 V > 

— oo otherwise 

where a{ci, c 2 ) and z(ci, c 2 ) give the lengths of the long- 
est common prefix and suffix, respectively, of a pair of 



codons. The first case handles a common codon, while 
the second case handles an unequal codon pair, which 
may end and/or begin a run. The score depends on that 
from the related cell, with an increment in/(-) accounting 
for any extension in run length and initiation of a new 
run. (See again Figure 3, right.) When there is a tie, we 
prefer the codon pair with the most common nucleo- 
tides, even if that has no impact on run score. This 
choice increases overall sequence identity, to promote 
better annealing of strands from different parents. 

The matrix is of size n * m * (3n + 1), since the run 
length potentially ranges from 0 to the entire DNA 
sequence length (3n). However, in practical cases, most 
run lengths are not attainable. Furthermore, for r x < r 2 , 
if R[i, 5, rj + f[r 2 ) - flji) < R[U s> r 2 ], then the r 2 cell 
"dominates" the r± one-the r\ one cannot be part of the 
optimal solution. Thus we modify the usual dynamic 
programming algorithm slightly, to avoid filling in cells 
with unattainable or dominated run lengths. We per- 
form the standard nested loop over i (residue position) 
and s (number of substitutions). Then for each i and s 
we determine which run lengths are attainable and 
undominated and fill in only those entries. Rather than 
keeping a 3D table, we keep a 2D table in which each 
cell has a list of run lengths and their scores. Note from 
the structure of Eq. 14 that we can determine the run 
lengths for z, s from the possible codons at i and the 
run lengths that were attained and undominated for i - 
1 and 5, s - 1, and 5-2 (depending on the numbers of 
substitutions required for the codons). 

Diversity optimization 

We have previously developed methods for optimizing 
the diversity of libraries of chimeras produced by site- 
directed recombination [13,14]. We showed that the 
total number of mutations in a library is a constant 
determined only by the parents, but that by assessing 
the squared-differences in the numbers, we can optimize 
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for a relatively uniform sampling of sequence space. In 
the case of two parents, we define the diversity variance 
over a library as: 

2 A -1 2 k 

° dW = 2W - I) * ^ ^ (m(H " Hj) ~ ™ )2 (15) 

^ ' 1=1 j=i+l 

where A is the number of fragments, m(H it Hj) is the 
mutation level (number of amino acid differences) 
between a pair of chimeras H t and Hp and m is the average 
of m over the library. (We drop a constant factor of 2, 
which doesn't affect the optimization.) To mitigate the 
effect of neutral mutations, rather than using literal equal- 
ity we measure m using one of the standard sets of amino 
acid classes. The goal is to minimize the variance, seeking 
to sample sequence space as uniformly as possible. 

The objective function is defined in terms of the chi- 
meras in the library. In the context of DNA shuffling, we 
assume that a sufficiently large run of common nucleo- 
tides (with respect to a threshold 6 as in Eq. 10) results in 
a breakpoint, and thus that the (full-length) chimeras are 
well-defined as all combinations of fragments between 
the breakpoints. Breakpoints resulting from smaller runs 
only add to the diversity of the resulting library. 

For an efficient algorithm, we must be able to com- 
pute the objective function during the optimization, 
without enumerating the exponential number of chi- 
meras. In our previous site-directed work [13], we devel- 
oped a recursive formulation relating the diversity 
variance for a library to that of a sub-library with one 
fewer breakpoint. That formulation took as given the 
total number of breakpoints, which isn't available in the 
DNA shuffling context. However, similar algebraic 
manipulations (omitted due to lack of space) yield a 
related formula without requiring pre-knowledge of the 
number of breakpoints. 

Claim 1 The diversity variance d(l, k) of a library from 
parent sequences P a and P b with kth breakpoint is at 
residue I can be computed from the diversity variance d 
(/', k - 1) for a library with (k - 1) st breakpoint at resi- 
due r < I by the following formula: 

d(l, k) = |^ * d{f, ft - 1) + E(l, I', ft) (16) 
where 

E(l, X, fe) = 

2\*- 1) (m(Pa[1/ n m ' n) * m(Pa[r + h 11 Pb[t + h l]) 
+ m{P a [l> + l,l],P h [l> + l,l]f) 
2 k -2 2*m{P a [l,l'],P b [l,r]) 2 *2 2k ~ 6 
+ 2 k - 1 * (2^-1 - l) 2 

2*m(P a [U],P b [U]) 2 *2 2fe - 4 
(2 fe - l) 2 



and we use notation P[i, j] to indicate the substring 
from position i to j, inclusive. 

Based on Eq. 16, we further extend our run-length 
optimization dynamic programming recurrence to opti- 
mize for diversity: 

D [0, 0, 0, 0, 0] = 0 (17) 
D [0, 5, r, I, k] = 0 s > 0 or r > 0 or / > 0 or k > 0 (18) 

f D[i-l, s -(t 1 + t2 ), r-3, /, fe] 

-E(i- 1, I, k+l) + E{i, 1, k+ 1) Ci =c 2 , r> 3 

2 (k + i) _ 2 

min wwzr D\i-l,s-{t i+t2 ),r',t,k-l] 
D\i,s,r,l,k]= min f + a{ Cl , c 2 ) > 9, +E (i, i - 1, k * 1) / "1 Q\ 

(ci,ti)ed[i], l'<i-l r = z( Cl ,e 2 )<3,! = i-l 

(c 2 ,t 2 )eC 2 [i]: . D[i-1, s - (ti + fa), /, k] 

ii+( 2 <s r- +a ™"<f> -E(i- 1, /, k+l) + E(i, I, fe+1) r = z(d, c 2 )<3, /<i-l 

oo otherwise 

i > 0 

We add two more dimensions, to keep track of /<, how 
many runs of length 6 we have seen (i.e., confidently 
yielding breakpoints), and /, where the last one was, as 
in the claim. Intuitively these two additional dimensions 
are necessary since the number of breakpoints affects 
the size of the library and thus the diversity variance, 
and since the additional diversity induced by a run 
depends on the nucleotides between the previous break- 
point and the new one. Note that in Eq. 16, k is the 
number of breakpoints, with the last breakpoint always 
at the end of the current position /; however, in Eq. 19, 
k is the number of previous runs, or k + 1 when substi- 
tuted into Eq. 16. As with run optimization, our imple- 
mentation avoids filling in the table for run lengths that 
are unattainable (though the notion of dominated 
entries does not carry over). 

Codon usage 

In order to promote better protein expression, we follow 
the GeneDesigner protocol [25] in employing organism- 
specific codon usage tables. A codon usage table for an 
organism [26] encodes the frequency with which each 
codon has been observed in a sequence database; differ- 
ent organisms display different "preferences" [27]. In a 
preprocessing step, we disallow rare codons that make 
up less than 10% of the occurrences for their amino 
acid. Then when computing one of the recurrences, we 
use the codon usage table to resolve cases where multi- 
ple possible codons give the same score (i.e., they have 
the same implications for continuing, ending, and begin- 
ning runs). In such cases, we selecting among the possi- 
ble codons with probability according to their usage 
frequency. 

Results and discussion 

We use three case studies to demonstrate the effective- 
ness of CODNS in optimizing DNA shuffling experi- 
ments. The first two case studies are a pair of 
glycinamide ribonucleotide (GAR) transformylases 
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(previously optimized by eCodonOpt [17]) and a pair of 
distantly related DNA polymerases (previously recom- 
bined by SCOPE [10]). We optimize shuffling plans using 
from 0 to 10 mutations under each of the objective func- 
tions, abbreviated in the figures as cn (common nucleo- 
tides), AG (nearest-neighbor approximation to change in 
free energy of annealing),/! (runs under f x scoring), f 2 
(runs under f 2 scoring), and dv (library diversity). 
We examine particular plans optimized under different 
objectives, in order to see how they differ in allocating 
mutations and producing homologous runs suitable for 
cross-overs. We then study the overall trends in optimiz- 
ing the objectives and in producing runs. We also con- 
sider the diversity of the chimeras that would result by 
recombination under different run-optimal plans. Com- 
parisons with what would result eCodonOpt [17] can be 
made by noting that it optimizes cn and AG (though we 
use an efficient dynamic programming algorithm to do 
so). In a third case study, we evaluate the effects of wild- 
type sequence identity on the optimization, using differ- 
ent pairs of beta-lactamases. 

GAR transformylases 

The parents for our first case study are a GAR transfor- 
mylase from E. coli and one from humans. Previous 
work showed that DNA shuffling crossovers are extre- 
mely rare without codon optimization [17]. We obtained 
the (gapless) alignment from the supplementary material 
of [17], and transcribed it to 201 amino acids with 82 
(40.8%) in common. The wild-type DNA sequences had 
47% nucleotides in common [17], with only two runs of 
length 7 and no runs longer than 7 nt. 

Figure 4 illustrates some optimal plans, showing runs 
of length > 9, a relatively confident threshold for cross- 
overs (analogous observations can be made with other 
thresholds, not shown to save space). In some cases, 
many plans may be tied for optimal under the objective. 
We extended our dynamic programming back-trace to 
generate the tied solutions [28]; for common nucleo- 
tides, we enumerated all, but for AG° n , we stopped after 
1000 were generated. The figure then shows the tied- 
for-optimal plans with the most nucleotides in runs (at 
a threshold of 9). All objectives yield the same run pat- 
terns with 0 mutations. However, with more mutations, 
run and diversity optimization methods focus their 
efforts on regions that are sufficiently similar to allow 
formation of runs with well-placed mutations and well- 
chosen codons, while choices made by common nucleo- 
tide and AG° n optimization are not as productive. 
Diversity optimization produces the same number of 
runs as fx optimization, but places the runs more evenly 
throughout the entire sequence so that crossing over at 



those sites would yield chimeras comprised of more uni- 
formly-sized fragments better sampling the sequence 
space spanned by the parents. ("Size" in diversity optimi- 
zation refers to residues at which the parents differ, not 
just the total number of amino acids [13].) 

We next analyzed the overall ability of CODNS to 
select codons and allocate mutations to meet the differ- 
ent optimization goals. Figure 5 illustrates the objective 
score trends with increasing numbers of mutations. All 
of the plots are quite linear (recall that AG° n is to be 
minimized), demonstrating that there is sufficient free- 
dom within these two parents to enable effective optimi- 
zation for the objectives, and that the algorithms are 
successfully exploiting the available freedom. Since the/ 2 
metric gives "partial credit" for run lengths of 6, 7, and 8, 
we break out those contributions to its score. We see 
most of the optimization still focuses on full 9 nt and lar- 
ger runs, which is natural given the reduced score contri- 
bution for shorter runs (since they are believed to be less 
productive in promoting recombination). The trends for 
diversity optimization are not shown here since scores 
are not directly comparable for libraries of different sizes 
(resulting from different numbers of runs yielded by dif- 
ferent patterns of codons and mutations). 

We introduced novel run-based objective functions in 
order to more directly target the sufficient stretches of 
parental homology required for annealing, only indirectly 
optimized by the objectives of common nucleotides and 
AG° n employed by eCodonOpt. To assess the impact of 
this more direct objective functions, we determined the 
number of runs produced by plans under the different 
objectives. We varied the threshold to consider a homo- 
logous region as a "run" from 7 (lower confidence) to 12 
(higher confidence). As discussed above, among the plans 
tied for optimal for a particular objective, we sought the 
one with the best run score. We evaluated both the num- 
ber of runs and the number of nucleotides in those runs. 
For the sake of space, Figure 6 presents only the results 
for a threshold of 9; the trends at the other thresholds 
are very similar. We see that, as mutations are intro- 
duced, optimizing directly for runs is indeed much more 
effective at producing runs than either of the "proxies" of 
common nucleotides or AG° n . We do not show trends 
for f 2 , as it also optimizes for "partial credit" runs (of 
lengths 6, 7, and 8). 

The final question regards diversity-how much control 
we can exert over the level of diversity we introduce 
(how different the resulting chimeras in a plan are from 
the parents and from each other). Here we deem a 9 nt 
run as sufficient for a breakpoint, and evaluate the ability 
of CODNS to minimize our library diversity variance 
objective (Eq. 15) while maximizing the number of runs. 
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Figure 4 GAR transformylase detailed plans. Shown are the locations of selected mutations (vertical lines) and resulting runs of length > 9 
(boxes; staggered at some locations to avoid overlap), optimized under different objectives. For f 2l runs are colored by length, with 6:green, 7: 
black, 8:magenta, > 9:blue. For 0 mutations, all methods yielded the same run pattern, so only the f 2 version is shown. 



Figure 7 illustrates some plans with the same (optimal) 
number of runs but different library diversity scores. As 
discussed above, mutations in diversity-optimal plans are 
optimally allocated so as to create runs more evenly dis- 
tributed throughout the entire sequence (counting posi- 
tions with different amino acids in the parents). 
However, plans with larger diversity variance scores place 
runs closer together and leave the C-terminal portion 
without any runs, thereby generating no diversity there 



(the final 100+ residues will be from one parent or the 
other, rather than a hybrid). 

DNA polymerases 

Our second case study involves two distantly-related 
members from the X-family of DNA polymerases: Afri- 
can swine fever virus DNA polymerase X (Pol X) and 
Rattus norvegicus DNA polymerase beta (Pol /3). While 
these two proteins share a similar fold, they have very 
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had only 158/642 (24%) nucleotides in common, with no 
common nucleotide runs of length greater than 5. Thus 
standard DNA shuffling techniques are unlikely to pro- 
duce any cross-overs. 

We optimized these parents under each of our objec- 
tive functions, using from 0 to 10 mutations. Figure 8 
illustrates an optimal plan for each objective with 0, 4, 
or 8 mutations, showing runs of length > 9. These parti- 
cular parents are so diverse that only a few such runs 
can be produced by codon selection alone (no muta- 
tions). We do see, however, that the run-optimization 
methods form one more run (positions 148-157) than 
do the common nucleotide and AG° n methods, and f 2 
forms some potentially productive shorter runs. The 
difference increases with more mutations, as run 



optimization directly allocates them so as to produce 
more runs, while, due to the parental diversity, the 
indirect choices made to optimize common nucleotides 
and AG° n are unlikely to lead to runs. With less free- 
dom, it is harder to optimize diversity. We do see that 
while thefi plan is diversity-optimal for 0 and 4 muta- 
tions, it is not for 8 mutations, and the diversity-optimal 
plan spreads mutations out more. We also observe that 
the N-terminal region is so diverse that no run is pro- 
duced there even with 8 mutations. Diversity optimiza- 
tion thus tends to create runs that are more evenly 
distributed in the large C-terminal region. 

Figure 9 illustrates the effectiveness of allocated 
codons and mutations in terms of the optimizing the 
different objectives. We again see linear trends for the 
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four objectives (as discussed with GAR transformylases, 
diversity is not directly comparable over different library 
sizes). Thus even with these two extremely diverse par- 
ents, it is possible to select mutations according to a 
specified objective, and our algorithm does so. 

Figure 10 illustrates how well the different objectives 
do at producing runs, and how many nucleotides com- 
prise those runs. For the sake of space (and as with GAR 
transformylases), we only illustrate under a 9 nt threshold 
for a homologous region to count as a run, but we found 
exactly the same trends with other thresholds from 7 to 
12. Once again, explicitly optimizing for runs proves to 
be much more effective at producing runs (and more 
nucleotides in them) than does indirectly optimizing for 
runs by overall nucleotide identity or the nearest-neigh- 
bor approximation to change in free energy of annealing. 
With these diverse parents, the indirect objectives do not 
happen to produce many runs or nucleotides in runs; in 
fact, they produce almost no runs of longer lengths (e.g., 
12, not shown), even with 10 mutations. 

Even with such diverse parents, there is sufficient free- 
dom in the codon and mutation choices that different 
run-optimal plans yield different levels of chimera diver- 
sity. Figure 11 illustrates some such plans, at two muta- 
tion levels. The three plans at the same mutation level 
have the same number of runs but increasing diversity 
(and increasingly more even distributions of run) from 
top to bottom. 

Beta-lactamases 

Our third case study examines the effect of wild-type 
sequence identity. Beta-lactamases, which hydrolyze the 
beta-lactam found in certain antibiotics (e.g., penicillin), 
have been the object of much chimeragenesis work, 
including DNA shuffling [1] and site-directed methods 
[11]. We previously developed a multiple sequence 



alignment (272 residues and gaps) of diverse beta-lacta- 
mases [15]. For the present study, we considered (a) the 
common beta-lactamase targets TEM-1 (E. coli) and 
PSE-4 (Pseudomonas aeruginosa) (42% amino acid iden- 
tity); (b) the even more diverse pair from P. aeruginosa 
and Bacillus licheniformis (26% id); (c) the more similar 
pair from E. coli and Proteus mirabilis (47% id). 

Optimizing the wild-type amino acid sequences for 
common nucleotides yields DNA identity of (a) 70%, (b) 
61%, and (c) 73%. These numbers are somewhat border- 
line for standard DNA shuffling. They do result in some 
runs, though generally fewer than when directly optimiz- 
ing for runs, a trend that widens with more mutations. 
Optimizing for runs yields the same behavior as observed 
for the previous two cases; due to lack of space, we only 
present the free energy score and the number of nucleo- 
tides in runs under the f x metric (Figure 12). We again see 
the linear tread for both objectives with increasing muta- 
tions from 0 to 10. The actual energy score and run score 
both depend on parental sequence identity, with the same 
ranking on both metrics. 

Conclusion 

DNA shuffling is a staple of protein engineering, and we 
have demonstrated that our new algorithms can sub- 
stantially improve the expected productivity of an 
experiment. Even without performing any mutations, we 
are able to allocate codons to better form runs. By per- 
forming a small number of conservative substitutions, 
not expected to significantly affect stability or activity, 
we generally are able to increase the number of runs 
and the number of nucleotides in runs, linearly with the 
number of substitutions. Finally, since we are establish- 
ing runs whose lengths are sufficient to promote regular 
recombination, we can enhance our optimization to 
account for properties of the resulting chimeric library. 
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Figure 12 Beta-lactamase objective function trends. Shown are the objective function trends (left: AG° n ; right: f-, at 6 = 9) for beta- 
lactamase plans with 0 to 10 mutations, for pairs of beta-lactamase parents from (red) E. coli and P. aeruginosa, (blue) P. aeruginosa and B. 
licheniformis, and (magenta) E. coli and P. mirabilis. 



Future directions include extending run optimization to 
incorporate the type of potential underlying AG° n (i.e., 
accounting for differences in nucleotide content), to 
optimize multiple parents simultaneously, and to inte- 
grate CODNS within our Pareto-optimization frame- 
work [29] in order to optimize productivity of shuffling 
in concert with other properties. While both extensions 
will increase the computational expense, the resulting 
gain in experimental efficiency could be well worth it. In 
summary, our methods yield a new approach to DNA 



shuffling that supports substantially more diverse par- 
ents, is more deterministic, and generates more predict- 
able and more diverse chimeric libraries. 
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